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Abstract: The robustification of pattern recognition techniques has been the subject of intense research 
in recent years. Despite the multiplicity of papers on the subject, very few articles have deeply explored 
the topic of robust classification in the high dimension low sample size context. In this work, we explore 
and compare the predictive performances of robust classification techniques with a special concentration 
on robust discriminant analysis and robust PC A applied to a wide variety of large p small n data sets. We 
also explore the performance of random forest by way of comparing and contrasting the differences single 
model methods and ensemble methods in this context. Our work reveals that Random Forest, although not 
inherently designed to be robust to outliers, substantially outperforms the existing techniques specifically 
designed to achieve robustness. Indeed, random forest emerges as the best predictively on both real life 
and simulated data. 
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1. Introduction 

We are given a data set 3> = {(xi, yi), • ■ • , (x„, y„)} where x,; G SC Cl pxl and yi G y. We specifically focus 
on the challenging multicategorical classification scenario involving the so-called high dimensional low sample 
size (HDLSS) datasets, that is, n <gC p or more precisely p much larger than n, and yi G y = {1,2, • • • , G}, 
where G represents the number of groups/classes to which a p-tuple x from the input space SC may belong. We 
consider the task of building the best predictively optimal estimator /(•) of the underlying true classifier /(•) 
of the data. Throughout this paper, we shall use the average test error AVTE(-), as our measure of predictive 
performance, namely 
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where /,.(•) is the r-th realization of the estimator /(•) built using the training portion of the split of 3> into 
training set and test set, and (xj r \yj r ^ is the i-th observation from the test set at the r-th random replication 
of the split of 3>. Here, we use the ubiquitous zero-one loss function defined by 


Z{y , W ,/r(x! r) )) = 1 { (r)^£ 


rn = 


1 ify < r) ^/ P (xj r) ) 
0 otherwise. 


( 1 . 2 ) 


The pattern recognition literature is filled with techniques created and developed to solve precisely this prob¬ 
lem. Amongst others, logistic regression, discriminant analysis, k —nearest neighbors, classification trees, support 
vector machine, random forests, boosted trees, relevance vector classifiers and gaussian process classifiers, just 
to name a few. Most of the literature in classification deals with data scenarios where the number n of obser¬ 
vations/instances is much larger than the dimensionality p of the input space SC. As stated earlier, this paper 
considers data sets of a very special kind, namely the so-called High Dimension Low Sample Size (HDLSS) 
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datasets, also known as large p small n data, since for this type of data, n p, i.e., n is much less than p. 
Data sets of this type are very common these days especially from the fields of study involving microarray gene 
expression used in diagnosing and helping cure diseases such as cancer. As a matter fact, we consider six such 
data sets in this paper containing information about various forms of cancer, namely prostate, lymphoma, lung, 
colon, leukemia, brain. Traditional classification techniques like logistic regression, discriminant analysis and 
k— nearest neighbors fail miserably on this kind of data, mainly due to the fact that the condition n <SgC p leads 
to illposedness, and thereby the inability of those methods to even have a solution. In the case of k— nearest 
neighbors for instance, the n p condition leads to a severe case of the curse of dimensionality, since the concept 
of neighbor then becomes loose and ill-defined when the dimension of the input space SG is far larger than the 
number of observations available Kondo et al. (2012). Several approaches have been proposed to achieve optimal 
classification in this HDLSS context. One of the earliest is regularized discriminant analysis (RDA) proposed 
and extensively developed by Friedman and Tukey (1974a), recently used by authors like Guo et al. (2006) in 
for the classification of microarray gene expression data. There is a vast literature on regularized discriminant 
analysis and regularized logistic regression, with a good number of the contributions dedicated to handling 
classification problems when n p. It is important to note that it is quite typical to have contamination in the 
data whenever the dimensionality of the input space gets ever larger. The presence of outliers in the data is hard 
enough in low dimensional spaces (n p with p relatively small), let alone in extremely high dimensional spaces 
where one now has to contend with both ill-posedness and outliers. Indeed, these situations trigger the need for 
both regularization (to deal with high dimensional ill posedness) and robustification to circumvent the ill-effect 
of outliers Vanden Branden and Hubert (2005). In the context of n < p, there is a relatively large literature on 
robust discriminant analysis with many of the contributions based on various approaches to robust estimation of 
both location and scatter Filzmoser et al. (2009), Filzmoser and Todorov (2011), Filzmoser and Todorov (2013), 
Pires (2003), Pires (2010), Todorov and Pires (2007), Todorov and Filzmoser (2009). Unfortunately, apart from 
Vanden Branden and Hubert (2005), Pires (2010), there has not been much work on robust discrimination when 
n is much less than p. In fact, we will reveal in our computational section that the traditional robust approach 
based on Minimum Covariance Determinant (MCD) estimation of the covariance structure fails miserably in the 
HDLSS context. The two approaches presented and explored by Vanden Branden and Hubert (2005) and Pires 
(2010) appear to still be in the very early stages of development. In our experimentations, we noticed that those 
techniques tend to work when n/p is close to 10 —2 , but they all fail or struggle if the ratio n/p gets smaller. In 
this paper, we explore both real life data - mainly microarray gene expression cancer data - and simulated data, 
and we reveal patterns exhibited by the average test error as a function n, p and G. In the context of simulated 
data, we also consider the impact of the contamination rate e, the magnitude k of contamination of the scatter 
matrix, and the level p of correlation among the predictor variables. Throughout our simulations, we use the 
same value for the contamination of the location. The rest of this paper is organized as follows. In section two, 
we present a brief summary of discriminant analysis with an emphasis on where the need for robustification and 
regularization arises. In section three, we discuss some of the most commonly used techniques of robustification, 
highlighting some of the limitations and merits of each method. In section four, we present the computational 
comparison of the techniques on real life data. In section five, we present a large simulation study, featuring 
various choices of the sample size n , dimensionality p , number of classes G, contamination rate e and contami¬ 
nation size k and correlation among the predictive variables p. We also highlight how various scenarios of these 
choices impact the average prediction error over R replications. In the section six, we present our conclusion 
and discussion along with a brief introduction to our future work dedicated to the regularized version of robust 
discriminant analysis in the HDLSS context. 

2. Elements of Discriminant Analysis 

Discriminant analysis is arguably one of the oldest and most commonly used approaches to pattern recognition. 
The main idea is that given G distinct groups one defines G different functions dfc(-), for k = 1, • • • , G, such 
that given a data matrix X = [x^, xj, ■ • • , x,[] whose ith row x,- 1 = (x^i, , ■ ■ ■ , x, p ) for which class label y; 

is desired, 

y i = dass(xj) = argmax{? fc (x,)} (2.1) 

fc=1,— ,G 

where <5fe(xj) is the estimator of <Jfc(xj). In other words, discriminant analysis estimates the class of a new object 
as the label whose discriminant functions yields the largest value given x,. Now, when the density function of 
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Xj given class k is known to be multivariate Gaussian, we have quadratic discrimination. If in addition all the 
classes have the same covariance matrix, i.e. {X\k) ~ MVN (fj, k , E), we are in the presence of linear discriminant 
analysis (LDA). In other words, with LDA, each sample group has its class mean fi k (k = 1,2, • • • , G) but shares 
the sample covariance matrix E with the other groups. If 7 r k = Pr[Y = k] denotes the prior probability of class 
membership, then the LDA discriminant function is given by 


4(x) 



/x fc ) T E 1 (x-/x fc ) + logTTfe. 


In practice, 5 k (-) is estimated from the data (xi,yi), • ■ •, (x„,y n ) by its empirical counterpart 


( 2 . 2 ) 
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where iT k = — = — is the observed proportion of group/class k observations and 
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the empirical (sample) mean in class k. both defined using n k = X) w ik, the number 
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k and the observed indicator of class membership given by 
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of observations from class 
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Finally, the estimate E of the common covariance E is often the pooled covariance given by 


J2 ( n k - 1 )S k 


E = 
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where S k is the sample (empirical) variance-covariance matrix of the kth class, namely 


S k = -- V w ik (x - /I fe ) T (x - /I fc ). 

n k - 1 ^ 


( 2 . 6 ) 


It turns out that both the estimated location fl k and the estimated scatter matrix E are sensitive to outliers, 
making the estimated discriminant function non robust under contamination. In the context of prediction, non¬ 
robustness leads to poor predictive performances due the fact the presence of outliers in the explanatory variables 
causes the vital components of the discriminant function to be biased. We therefore need robust methods to 
estimate both location and scatter parameters in order to decrease the prediction error. When in addition we 
have p much larger than n, we encounter the extra problem of non-invertibility of E. Typically, one can solve 
this problem by selecting few subset of variables. However, a more general approach deals with the problem by 
regularizing E using E = E + A I p with A S (0, oo), or a convex and more general version of it where 

E = (1 — a)E + —trace(E )I p (2-7) 

with a C (0,1). Some of the earliest work on regularized discriminant analysis include the seminar paper by 
Friedman and Tukey (1974a), and later applications by Guo et al. (2006), just to name a few. Many authors 
have contributed extensively in the area of regularized discriminant analysis. Clearly, in the presence of both 
outliers and high dimensionality, one needs to both robustify and then regularize. In the subsequent section, we 
mainly focus on the kind of robustification that deals with contaminated high dimensionality through a suitable 
combination robust PCA and projection pursuit Pires (2010), Vanden Branden and Hubert (2005). 
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3. Robust Estimation Methods for Linear Discriminant Analysis 


In section two, one the challenges of discriminant analysis came from the fact that in the presence of out¬ 
liers, the location and scatter estimates are not reliable because they are not robust. Many authors have 
contributed a wide variety of approaches all aimed at at robustifying LDA. One of the earliest approaches 
used to address this problem is a technique known as Minimum Covariance Determinant (MCD) introduced 
and developed by Rousseeuw (1984), Rousseeuw (1985). The literature on robust discriminant analysis has 
blossomed recently, with many papers studying and exploring various extensions of MCD. In this paper, we 
will be comparing the predictive performances of extensions developed by (Croux and Dehon (2001)), then the 
one explored by He and Fung (2000) and Hubert and Van Driessen (2004). We will also look into the predic¬ 
tive performances of MCD extensions proposed by (Hubert and Van Driessen (2004)), and the ones developed 
by Todorov and Pires (2007), Hawkins and McLachlan (1997). Another paper extending the work on MCD is 
contributed by Hubert and Debruyne (2010). It is important to note that all the above mentioned variations 
on the MCD theme have been implemented in R with packages like rrcov and rrcovHD readily available for 
immediate installation and use. Our goal in this paper is to consider different scenarios of contamination in 
high dimensional spaces, and then compare the performances of existing techniques, with the finality of estab¬ 
lishing the conditions under which each one of the techniques performs best. Rousseeuw (1984)’s original MCD 
estimator is quite intuitive and can be briefly described as follows: given n observations xi,X 2 ,-- - , x„ taken 
from a p-dimensional space J Cl p , with true location vector //, and true scatter matrix S, find the subset of 
h observations out of n such that the corresponding sample (estimated) covariance matrix yields the smallest 
determinant. In other words, we must have 


det(S( 7 ( MCD ))) 


min 

7 £{ 0 , 1 } 


{det(S( 7 ))| 


where 7 G { 0 , 1 }™ simply represents the indicator vector such that 7 $ = 1 if observation i is among the h chosen, 
and 7 j = 0 if it is not. Obviously, the final indicator vector 7 ( MCD ) chosen by the MCD method is such that 
length^* 1 ™)) = | 7 ( MCD )| = h. Also, we use the notation 0(-y) to denote the estimator of 0 based on only the 
observations selected by the indicator vector 7 . Essentially, the MCD estimator of the location parameter p, is 
defined by the mean of that subset 7 ( MCD ) and the MCD estimator of the scatter matrix parameter S is defined 
by the covariance of that subset 7 l - MCD h More specifically, 

A*mcd = / 4 .( 7 < ' MCD ’ ) ) and Xmcd = S( 7 < - MCD ^). 

In practice, Smcd is chosen in such a way that it is a multiple of its covariance matrix. The multiplicative factor 
is chosen in such a way that Smcd is consistent at the multivariate normal model and unbiased for small samples 
(Pison et al. (2002)). Now, the MCD approach requires that ^ < h < n, with h = [(n + p + l)/2] yielding the 
maximal breakdown point. Central to MCD is the fact that h must be determined or set. In fact, it should be 
noted that the MCD estimator cannot be computed when p > h, since such a scenario would mean having a 
singular covariance matrix for any h—subset. It turns out that the whole MCD machinery needs n > 2p in order 
to function at all. For our high dimension low sample size problems for which p n, it is obvious that the 
basic formulation of MCD does not work. We discuss a little later the extension of MCD known as regularized 
MCD whereby the estimates of interest are obtained in their regularized version. Even within the satisfaction of 
the n > 2p requirement, MCD is essentially very computationally intensive for the simple reason that the need 
to select a subset combinatorially to optimize a criterion requires a number of computing operations that can 
explode for even small sample sizes. Many faster versions of the basic MCD algorithm have been suggested, led 
by Rousseeuw and Driessen (1998). As we shall see later in our computations, different variants of MCD lead 
to sometimes drastically different predictive performances. In the context of discriminant analysis, one of the 
obvious limitations of MCD lies in the fact that it trims all the classes/groups equally. Such an equal treatment 
of all the groups is potentially inefficient in the presence of uncontaminated groups. Many other drawbacks 
of the basic MCD approach to discriminant analysis have been scrutinized and addressed by authors such as 
He and Fung (2000), Hubert and Van Driessen (2004), Christmann and Hable (2013), Croux and Dehon (2001). 
Each of these variants was proposed in order to address/solve a perceived limitation/drawback of the basic MCD 
approach. Unfortunately, in turns out all these variants of MCD only work when p is less than n. In fact most of 
them require n to be at least greater than 2 p. In order to deal with n less than p situations one had to abandon 
MCD in its present form. As a matter of fact, we are currently working on an extension of the MCD approach 
that combines robustification and regularization to address HDLSS situation with n/p arbitrarily very small. In 
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this paper however, we explore two non-MCD based techniques, namely robust SIMCA (described later) and 
projection pursuit (PP) discriminant analysis. As we will see in the computational section, PP will proved to 
be quite flexible but unfortunately fail when n/p gets to small (less than to 10~ 2 ). We will see later that PP 
yields relatively poor predictive performances when the number of classes is greater than 2 and/or the intrinsic 
dimensionality of the input space is inherently high. 

SIMCA approach to High Dimensional Robust Classification: Soft Independent Modelling of Class Analogies 
(SIMCA) was introduced by Wold (1976). Bicciato et al. (2003) explain that SIMCA ability to classify high di¬ 
mensional data comes from the fact that it is based on a clever adaptation of principal component analysis (PCA). 
Thanks to the supervised nature of discriminant analysis (class membership known), SIMCA proceeds by per¬ 
forming principal component analysis in each of the G classes separately. Essentially, SIMCA can be summarized 
as an approach that combines robust PCA within each group based on robust covariance estimation to achieve 
good predictive performances in classification. More details on SIMCA can be found in Hubert and Van Driessen 
(2004), Hubert and Engelen (2004) Bicciato et al. (2003) and Vanden Branden and Hubert (2005). SIMCA has 
been widely applied to areas as diverse as image analysis, microarray gene expression classification, and many 
other fields where data exists with n much less than p. An implementation of Robust SIMCA is provided through 
the R package rrcovHD, and will be used in our comparison of predictive performances of high dimensional ro¬ 
bust classifiers. 

Projection Pursuit Approach for Robust Linear Discriminant Analysis: From Wikipedia, Projection Pursuit 
(PP) is a type of statistical technique which involves finding the most ”interesting” possible projections in mul¬ 
tidimensional data. Often, projections which deviate more from a normal distribution are considered to be more 
interesting. As each projection is found, the data are reduced by removing the component along that projection, 
and the process is repeated to find new projections; this is the ’’pursuit” aspect that motivated the technique 
known as matching pursuit. The idea of PP is to locate the projection or projections from high-dimensional 
space to low-dimensional space that reveal the most details about the structure of the data set. Once an in¬ 
teresting set of projections has been found, existing structures (clusters, surfaces, etc.) can be extracted and 
analyzed separately. PP has been widely use for blind source separation, so it is very important in independent 
component analysis. PP seek one projection at a time such that the extracted signal is as non-Gaussian as pos¬ 
sible. From the seminal article Friedman and Tukey (1974b), many authors like Friedman and Stuetzle (1981), 
Friedman (1987) have developed applications and extensions of PP to wide variety of statistical problems. 
There have also been many theoretical justifications and discussions of the strengths and appeal of PP Huber 
(1985), Hall (1990). PP has also been used in discriminant analysis and outlier detectionPan et al. (2000) and 
Polzehl and Forschungsgemeinschaft (1993). Pires (2010) presents one of the most recent developments on PP 
for robust discriminant analysis in high dimensional spaces. Her work builds first on the original PP idea as 
presented and developed by Friedman and Tukey (1974b) and Huber (1985), and combines robustness strategies 
and the foundational idea of PP presented in the above definition, to achieve robust linear discriminant analysis 
in high dimensional spaces. Four of the seven techniques compared in this paper are based on the PP approach 
to robustness linear discriminant analysis in high dimension spaces. As we’ll see later though, it will turn out 
that PP techniques will fail - somewhat catastrophically at times - when the intrinsic dimensionality of the data 
is high. This is unsurprising, since the very idea of PP presupposes the existence of a lower dimensional space 
as the true/intrinsic basis of the data. 

Ensemble Learning Approach to Robust Classification: It is often common in massive data that selecting a 
single model does not lead the optimal prediction. For instance, in the presence of multicollinearity which is 
almost inevitable when p is very large, function estimators are typically unstable, as they tend to exhibit rather 
estimation large variances. This issue of inflated variance gets even more amplified in the presence of data 
contamination. It makes sense that when data contains outliers, they will cause the variance of estimators to 
increase. The now popular bootstrap aggregating also referred to as bagging offers one way to reduce the variance 
of the estimator by creating an aggregation of bootstrapped versions of the base estimator. This is an example 
of ensemble learning, with the aggregation/combination formed from equally weighted base learners. Consider 
regression for instance and let g ( ' b ' ) (-) be the 6th bootstrap replication of the estimated base regression function 
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g(-). Then the bagged version /( ba s6 in s)(.) 0 f the estimator /(•) is given by 


J(bag g ing) (x) = _^g(b) (x) ^ 


6=1 


If the base learner is a multiple linear regression model estimator g(x) = fio + x T (3, then the 6th bootstrapped 
replicate is g^ b ^(x) = /3^ + x T /3^, and the bagged version /<. ba ss in s)(.) of the estimator /(•) is 


J{ bagging) ^ 



6=1 



In this paper, our focus is on classification in high dimensional space where p is larger than n and the data also 
contains some outliers. Now, let’s consider a multi-class classification task as defined much earlier with labels y 
coming from LV = {1, 2, • • ■ , G} and predictor variables x = (xi, X 2 , • • • , x p ) T coming from ap-dimensional space 
LAL. Let g^(-) be the 6th bootstrap replication of the estimated base classifier g(-), such that (y)^ = g^(x) is 
the 6th bootstrap estimated class of x. The estimated response by bagging is obtained using the majority vote 
rule, which means the most frequent label throughout the B bootstrap replications. Namely, 

/^bagging) = Most frequent label in C^(x), 


where 

C (B) (x) = |g ( 1 ) (x),g ( 2 ) (x),-" ,g (B) (x)j. 

Succinctly, we can write the bagged estimated label of x as 


b agging) (x ) = arg max {freq e(B)(x) (y)} = argmax (l {y=f(b)(x)} ) | . 

It must be emphasized that in general, ensembles do not assign equal weights to base learners in the aggregation. 1 
For our purposes in this paper, it is crucial to address the issue of how bagging predictors can help achieve 
robust classification in high dimensional spaces. It turns out that by combining bagging with subsetting of the 
input space, we end up with the so-called Random Subspace Learning (Attribute Bagging) which, as we’ll argue 
later, yields very good predictive performances on high dimensional data contaminated with outliers. Consider 
the training set LA = {z, = (xj,yi = 1, • • • , n}, where x^ = (xq, • • • ,Xi p ) and y^ £ LAA are realizations of 
two random variables X and Y respectively. Suppose our goal is to use a total of B base learners to build an 
ensemble estimator / ba ss in g(.) of the underlying function /(•). Random Subspace Learning (Attribute Bagging) 
proceeds very much like bagging, with the added crucial step consisting of selecting a subset of the variables 
from the input space for training rather than building each base learners using all the p original variables. 

• Randomly draw the number d < p of variables to consider 

• Draw without replacement the indices of d variables 

• Build the d-dimensional model 

This step is the main ingredient for variable importance estimation and also has the benefit of circumventing 
the limitation of bagging due to correlatedness of the trees making up the ensemble. 

Let Zi £ LA be a random pair in the original training set LA of size n, and consider the bootstrapped sample 
LA of size n also, generated by sampling with replacement from LA. Let Prjz^ £ LA I A' 1 ] represent the proportion 
of observations from LA present in LA( b \ Then it can be shown that Pr[zj £ LA^} = 1 — (l — L) As a result, if 
Pr[zi ^ LA = Pr[O n ] denotes the proportion of observations from LA not present in LA( b \ then 

Pr[z s i LA W] = (l - = Pr[O n ]. (3.1) 

lr The general formulation in the context of regression for instance is j( ba gg in g) ( x ) = Ylb=i a ^S^( x )> where the aggregation is 
often convex, i.e. Ylb =l = 1- 
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Algorithm 1 Random Subspace Learning for Model Aggregation 


. Z'B * 1 } 


procedure RandomForest(S) 

Choose a base learner g(-) 

Choose an estimation method 
for b = 1 to B do 

Draw with replacement from @ a bootstrap sample = {z^, • • • 

Draw without replacement from {1, 2, • • • , p} a subset {j[ b \ • • • , j^} of d variables 
Drop unselected variables from so that i^ub ^ dimensional 
Build the 6th base learner based on i^ub 

end for 

Use the ensemble |g( & )(-), &=!,••• , B | to form the estimator 


> The Random Forest Algorithm for B trees 

> e.g.: Trees 
> e.g.: Recursive Partitioning 


H rf ) _ ar g max 

ye& 


{ fre qc(B) W (y)} = ar e y m |* 


B 

£< 

b= l 


L {y=g( b )( X )} 


11: end procedure 


It turns out that Pr[0„] —» e -1 « 0.37 as n — > oo. Which means that roughly about one third of the training set 
is not used when building the &th bootstrapped based learner. In the context of data contaminated with outliers, 
each observation - including outlier - has an asymptotic probability of e -1 of not affecting the base learner at 
hand. It is therefore reasonable to deduce that by averaging this exclusion of outliers over many replications (B 
base learners, with B typically large), one can achieve a robust classifier through bootstrap aggregation. Our 
conjecture in this regard is that through its bootstrapping mechanism, each outlier has a probability of e -1 of 
not affecting the base learner. 

4. Computational Comparison of Predictive Performances 

We now consider comparing the predictive performances of the techniques described earlier on real life data. 
Among other things, we present the apparent (training) error and the true (test) error which in this case is 
more precisely average test error over R replications as defined in (1.1). 

avte (/) = ^e{^E *(y.- r) > fM r) ))\ ■ 

r=l l z=l ) 

Throughout this paper, each replication randomly assigns 2/3 of the data to the training set and 1/3 to the 
test set. We do not consider a validation set because none of the techniques is based on a tuning parameter. 
We use R = 200 replications. We analyze 7 different datasets, six of which are high dimension low sample size 
(HDLSS) microarray gene expression datasets. For clarity and completeness, we present both the tables and the 
plots depicting the predictive performances of the methods explored and analyzed. 

4-1- Description of the datasets 

(i) Diabetes data: Our first data set deals with diabetes. It contains 145 observations, 3 variables and three 
classes. This is obviously not a high dimensional dataset, but we use it here to reveal the stark differences 
in performance between methods when one switches from large n small p to large p small n. This dataset is 
available in the R package called mclust contributed by Reaven and Miller (1979). (ii) Ceramic pottery data: 
This pottery data set was analysed by Stern and Descoeudres (1977). Other authors have used it to test the 
robustness of their methods, namely Cooper and Weekes (1983) and later Pires (2010). Please note that these 
first two datasets are qualitatively different from all the other datasets explored in this paper. Indeed, all the 
other 5 remaining datasets we explore here have in common the fact that they are all microarray gene expression 
data sets, (in) Prostate cancer data: This data set comes from microarray gene expression profiles/levels 
on prostate cancer, and is a subset of a much larger data set from a study by Stephenson et al. (2005). This 
dataset has 37 samples classified as recurrent and 42 as non-recurrent primary prostate tumor, (iv) Lymphoma 
data: The following data set deals with Lymphoma and contains 180 observations and 661 variables, (v) Lung 
cancer data: This dataset on lung cancer s just one of many existing lung cancer datasets. The version we 
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explore in this paper contains 197 observations and 1000 variables, (vi) Colon cancer data: From Alon et al. 
(1999), it contains 62 observations on subjects classified into two groups (Gl: subjects with colon cancer, with 
40 observations; G2: healthy subjects, with 22 observations) and measured on 2000 variables (gene expression 
levels). The aim is to predict, as accurately as possible, the disease status from the gene expression levels. This 
is a well known data set in the modern classification literature (e.g., References from the paper) and the original 
version is available in the colonCA R package from Bioconductor. The raw data is not normalized/preprocessed, 
which may lead to very bad classification results. Therefore a simple normalization procedure was applied: the 
data were log-transformed and after that each row was individually centered using its median, (vii) Leukemia 
data: In this Leukemia data set, there are 3571 variables (features), 72 samples (Golub (1999)). (viii) Brain 
cancer data: The last data set considered in this paper is a brain cancer dataset (Pomeroy (2002)). The total 
number of patients in this case is n = 42, each represented by p = 5597 microarray gene expression features, 
covering 5 different types of brain cancer. Using the packages R packages rrcov and rrcovHD, we first computed 
the apparent error for each of the methods on all the datasets mentioned above. 


4.2. Comparisons of Methods on Real Data Using the Apparent Error 


Although the core of our work in this paper is focused on predictive performance as stated right from our 
introduction, we devote this subsection to comparisons based on the apparent (training) error, following from 
Todorov and Pires (2007) and Pires (2010) who performed similar comparisons. At the very least, the compu¬ 
tation (or attempt thereof) of the apparent misclassification rates gives a rough idea of how well the method 
might perform predictively, when they can be applied at all. 


n 

P 

= 48.33 

Diabetes 

5TO = 0158 

Prostate 

= 0.272 

661 

Lymphoma 

197 — o 197 
1000 

Lung 

rono = 0 031 
Colon 

357T = 0020 

Leukemia 

5#f = °°° 75 

Brain 

Classic 

13.10 

NA 

NA 

NA 

NA 

NA 

NA 

Linda 

10.35 

NA 

NA 

NA 

NA 

NA 

NA 

pp 

27.58 

29.11 

55.00 

21.83 

11.29 

5.55 

52.38 

SIMCA 

11.72 

35.44 

9.44 

5.58 

9.68 

12.50 

NA 


Table 1 

Apparent misclassification rates for the classical and robust estimators under different scenarios 


As can be seen on Table (1), Linda (which is an R implementation of the MCD approach) and classic LDA only 
work when n/p is greater than 1, in this case on the diabetes data. PP and SIMCA do handle HDLSS ( n/p < 1), 
with the exception that SIMCA fails when n/p < 10 -2 . When they can both be applied, there is no clear winner 
between PP and SIMCA: on some datasets, PP outperforms SIMCA, but on other datasets, it is the other way 
round. This seems to indicate that besides the impact of n/p, there is also the effect of the internal geometry of 
the data. 


4-3. Comparisons of Methods on Real Data Using the Average Test Error 

In the interest of comparing the predictive optimality of the techniques explored, we now present the perfor¬ 
mances of 4 different variants of PP discrimination, against robust SIMCA, RF and Diagonal Discrimination 
Analysis (DDA). Here, we use R = 200 replications, and some of the results are summarized in Table (2) and 
Figure (1). As revealed in Table (2), one of the most immediate remark that emerges from our computations 
is the fact that SIMCA comes out as the worst in predictive performance 3 times, which is far more than any 
other method. SIMCA also never comes out as best. We actually provide greater details about this mediocre 
aspect of SIMCA in the simulated data section later. Another noteworthy aspect of these computations is the 
fact that DDA appears to perform very well, specifically emerging as the best 3 times (which is more times 
than any other technique). Given the fact DDA operates under the strong assumptions of uncorrelatedness of 
the predictor variables, we are let to infer that most of the datasets, especially those on which DDA has the 
best predictive performances, are inherently very high dimensional. In fact, this line of thought is somewhat 
supported by the fact overall, PP, which relies on the existence of a lower dimensional projection of the data, is 
the most unstable of the all the methods involved as shown clearly by Figure (1) where PP depicts huge spikes 
corresponding to large prediction errors. As depicted on the right panel of Figure (1), the lymphona data causes 
PP to fail miserably, probably because the intrinsic dimensionality of this data is so high that the projections 
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fail to find lower dimensional representations. RF on the other hand provides what we perceive as the most 
stable of all the performances: it can be seen in Table (2) and Figure (1) that RF never comes last, and is 
usually best or second best, with no instability spikes. As a matter of fact, the right panel of Figure (1) shows 
the superior performance of RF and DDA, and relatively good performance of SIMCA on the lymphoma data, 
whereas all the PP variants fail catastrophically. 



PP-class 

PP-huber 

PP-mad 

PP-sest 

RSIMCA 

RF 

DDA 

diabetes 

28.791 

27.21 

28.07 

28.44 

14.96+ 

3.54* 

19.32 


(5.41) 

(5.68) 

(5.57) 

(5.40) 

(4.95) 

(2.53) 

(4.99) 

ceramic 

8.44* 

9.44+ 

12.50 

11.06 

23.001 

16.00 

9.72 


(9.07) 

(8.32) 

(10.14) 

(9.35) 

(13.62) 

(10.13) 

(9.96) 

lymphoma 

58.96 

58.14 

59.03 

59.721 

20.22 

8.98+ 

8.64* 


(6.02) 

(5.95) 

(6.36) 

(5.63) 

(7.29) 

(3.98) 

(3.55) 

lung 

22.201 

22.02 

22.15 

21.90 

8.12 

5.23+ 

2.69* 


(4.18) 

(4.18) 

(4.28) 

(4.09) 

(3.37) 

(2.53) 

(1.69) 

colon-1 

17.90 

18.29 

18.02+ 

17.52 

23.761 

18.76 

15.55* 


(7.73) 

(7.99) 

(7.68) 

(8.07) 

(10.34) 

(8.96) 

(6.78) 

colon-2 

23.38 

23.05 

21.67+ 

21.19* 

25.88 

21.71 

31.861 


(8.46) 

(8.47) 

(9.05) 

(9.71) 

(11.82) 

(9.19) 

(15.27) 

leukemia 

5.58 

5.21 

5.13 

4.85 

15.561 

3.04* 

3.21+ 


(5.04) 

(4.19) 

(4.90) 

(4.65) 

(12.47) 

(4.04) 

(3.66) 


Table 2 

Average test error along with the corresponding standard deviation in parentheses. The star (*) is used to indicate the method 
with the best predictive performance, while the double dagger (%) indicates the worst predictive performance. The dagger (\) 
identifies the second best. The absence of prostate and brain is due to the fact that many methods explored could not even handle 

them. SIMCA for instance could not handle the brain cancer data set. 



Fig 1. (left) Average test error as a function of n/p. The data sets appear on the x axis in decreasing order of n/p. The first 
one(diabetes) has n/p = 145/3 and the last one (leukemia) has n/p = 72/3571; (right) Average test error on the lymphoma data 
set for which n/p = 180/661, and the number of classes is g = 3. These box plots compare the predictive performances of all the 7 
methods considered. 


4-4- Comparison of Predictive Performances on Simulated Data 

Based on the computations performed earlier on real life F1DLSS microarray gene expression datasets, some 
patterns began to emerge, among which the overall stability and relatively strong predictive performance of RF, 
a method not structurally aimed at robustness. We also noticed that PP is a very unstable method, typically 
producing the worst performances on most data sets. Despite all these initial findings, we still do not have a 
general characterization of which aspects of the data drive the performances of each method. In this section, we 
use a thorough simulation study with various aspects of data characteristics, with the finality of determination 
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what makes methods work well. Throughout this paper, all our simulated data will be generated from the 
multivariate Gaussian whose density will be denoted by </> p (x; ft, E), and defined by 


(x;/x,E) = 




ex P { U) E (x-M) ? ■ 


(4.1) 


Under the e-contamination regime, the class conditional density of A" in class k is given by 


p(x| m, E, k, e, t), k) = (1 - e)^ p (x; p k , E) + e<j) p (x; p k + rj, kE), 


(4.2) 


where rj represents the contamination of the location, while k captures the level of contamination of the scatter 
matrix. Furthermore, in order to study the effect of the correlation pattern, we simulate the data using a 
covariance matrix E parameterized by r and p and defined by 


/ 1 P . P\ 

pi p ••• p 


E = E(t, p) = t 


= r[{l - p)I p + pi pip] 


P P 1 P 

\ P . P 1 / 


where I p is the p-dimensional identity matrix, while l p is p-dimensional vector of ones. In fact, a more general 
covariance matrix to help better gauge the effect of correlation on the robust classification methods would be 
tE, where E = (cr,j), with = p\ l ~ 3 'I, that is, 


( 1 

P 


E = E(r, p) = r 


P 

1 


O p - 2 pP - 1 
f)P~ 2 


P P 2 ' . p 1 p 

V pp - 1 pp - 2 ■■■ p i / 


For simplicity however, we use the first E with r = 1 throughout this paper. For the remaining parameters, we 
use e G {0, 0.05,0.15}, k G (9, 25,100}, G € (2, 3} and p G {0,0.25,0.75} and p G {10,100,1000}. As the vector 
of e values shows, we consider 3 different levels of contamination, namely no contamination, mild contamination 
and strong contamination. 


Uncontaminated Data 


We first consider the performances of the techniques under an uncontaminated regime, i.e. e = 0. Our first 
simulation on under this regime looks at combination where the number of classes is G = 2 and than investigate 
the effect of p and p (input space dimension). As the plots all reveal, PP appears to perform very well (usually 
outperforming all the other methods) whenever intrinsic dimensionality of the data is low (captured by p very 
high) and the number of classes is 2 (Binary classification task). 



PP-class 

PP-huber 

PP-mad 

PP-sest 

RSimca 

RF 

DDA 

10 

58.26 

59.22 

61.19 

61.26 

41.11 

45.11 

43.26 

100 

54.11 

50.89 

53.33 

55.63 

40.44 

42.00 

43.44 

1000 

44.44 

35.26 

35.74 

35.96 

30.81 

28.70 

35.89 

2000 

44.17 

34.00 

34.28 

34.67 

30.89 

32.44 

38.50 


Table 3 

Average test error on the uncontaminated simulated data with g = 3 and p = 0.75. We herein reveal for each of the 7 methods, 

the effect of the input space dimension p on the average test error. 


Table (3) clearly reveals that PP does not work well n multi-categorical classification. With the number of classes 
just equal to 3, PP yields the worst predictive performance, regardless of the value of the overall correlation 
among the variables. As we discuss much later, PP typically performs well in binary classification when p 
is relatively large. But clearly, as depicted in Table (3), PP, at least in the implementation used here does 
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Fig 2. (left) Average test error on the uncontaminated simulated data with g = 3 and p = 0.75. We herein reveal for each of the 
7 methods, the effect of the input space dimension p on the average test error, (center) Average test error on the uncontaminated 
simulated data with g = 2 and p = 0. We herein reveal for each of the 7 methods, the effect of the input space dimension p on the 
average test error, (right) Average test error on the uncontaminated simulated data with g = 2 and p = 0.75. We herein reveal for 
each of the 7 methods, the effect of the input space dimension p on the average test error. 


NOT handle multiclass tasks well, even when the data is potentially representable in a lower dimensional 
space (large p). It is important to emphasize that this behavior of PP noticed here on uncontaminated data 
carries over to contaminated at various rates of contamination. Figure (2) clearly shows different scenarios of 
predictive performances under different data characteristics when the data has no contamination. The left panel 
corresponds to the case when the number of classes is 3 and the correlation among variables is very high. In this 
case, RF and SIMCA emerge as the best whereas PP fails. In the center, we see another excellent performance 
of RF. wth DDA emerging as the best while PP fails, this time due to the fact that p = 0. The right panel 
highlight the ideal conditions for PP, namely binary classification along with large correlation. 

4-4-%- Effect of Mild Contamination 

We now consider the performances of the techniques under a mildly contaminated regime, i.e. e = 0.05. Our first 
simulation on under this regime looks at combination where the number of classes is g = 2 and than investigate 
the effect of p and p (input space dimension) and n. Figure (3) reveals what we anticipated earlier, namely that 
PP performs well, typically emerging as the best predictive technique, when g=2 (binary classification) and p 
is large (intrinsically lower dimensionality of the data). In fact, under these two PP-favorable conditions, PP 
yields the best performance regardless of the size n of contamination. 



Fig 3. Average test error on the mild contaminated simulated data with g — 2 and p = 0.75. We herein reveal for each of the 7 
methods, the effect of the input space dimension p on the average test error. 
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Fig 4. Average test error on the mild contaminated simulated data with g = 2 and p = 0.25. We herein reveal for each of the 7 
methods, the effect of the input space dimension p on the average test error. 



Fig 5. Average test error on the mild contaminated simulated data with g = 2 and p = 0.0. We herein reveal for each of the 7 
methods, the effect of the input space dimension p on the average test error. 



Fig 6. Average test error on the mild contaminated simulated data with g = 3. We herein reveal for each of the 7 methods, the 
effect of the input space dimension p on the average test error. 
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Also noteworthy here is the fact that DDA fails for all the values of n when p is large. Finally, we also notice 
that RF does reasonably well, while SIMCA gets progressively worse as k increases. 

C4-3. Effect of Strong Contamination 

We now consider the performances of the techniques under a strongly contaminated regime, i.e. e = 0.15. Our 
first simulation under this regime looks at combination where the number of classes is g = 2 and then investigates 
the effect of p and p (input space dimension) and k. 



Fig 7. Average test error on the strongly contaminated simulated data with g = 2 and p = 0.75. We herein reveal for each of the 
7 methods, the effect of the input space dimension p on the average test error. 



Fig 8. Average test error on the strongly contaminated simulated data with g = 2 and p = 0.25. We herein reveal for each of the 
7 methods, the effect of the input space dimension p on the average test error. 


We see in Figure (10) that under the strong contamination regime, RF emerges as the best in multi-categorical 
classification tasks, regardless of the correlation level and the size k of contamination. SIMCA also performs 
very well under these conditions, typically taking the second place to RF in predictive performance. It can said 
that overall, RF and SIMCA are the most practical and applicable of all the techniques explored here, since 
they both do well for the realistic scenario of multicategorical classification with some rate of contamination. 

5. Conclusion and Discussion 

We have presented a thorough comparison of the predictive performances of several robust classification methods 
on high dimension low sample size data. On both real life and simulated data, interesting patterns emerged. We 
noted for instance that the SIMCA method, by being somewhat very general tends to yield mediocre predictive 
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Tost Error(p) 




Test Error(p) 





Fig 9. Average test error on the strongly contaminated simulated data with g — 2 and p = 0.00. We herein reveal for each of the 
7 methods, the effect of the input space dimension p on the average test error. 





Fig 10. Average test error on the strong contaminated simulated data with g — 3. We herein reveal for each of the 7 methods, the 
effect of the input space dimension p on the average test error. 
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performances when p is much larger than n, even though it rarely yield the worst among compared classification 
techniques. One of the most striking remarks in our study has to do with projection pursuit, the clearest being 
the fact that projection pursuit seems to do well only in binary classification. As a matter of fact, for all the 
scenarios involving more than two classes, projection pursuit seems to fail miserably regardless of all the other 
aspects of the data. Strikingly also, when there are only two classes, projection pursuit yields the best predictive 
performance if the correlation among the input space variables is large. This leads us to conclude that projection 
pursuit as a method for robust discriminant analysis is - at least in its present form - only best suited to binary 
classification for data whose intrinsic dimensionality is very low. For us, the most striking observation lies with 
the performance of random forest. Indeed, as can be noted in all the computational results presented earlier, 
random forest tended to be the best overall. More precisely, there was no instance where random forest yielded 
the worst performance, and in most cases, it was either the very best or the second best. As we explained 
earlier, this can be explained by the very mechanism of random forest in the sense that at every iteration of the 
construction of a random forest, not only is the estimator based on a subset of input variables, but also crucially 
the bootstrap mechanism leaves out a proportion e -1 of the sample. This left out fraction certainly contains some 
of the outliers. It is our conjecture as indicated earlier, that the fact of leaving out a fraction of the data allows 
random forest to weed out outliers or at least average out their effect. Hence the inherent ability of random forest 
to achieve robustness by random subsampling. One could conjecture that the overall superior performance of 
random forest can be attributed to the fact it does both variable selection (by random subspace learning) thereby 
inherently addressing the extremely high dimensionality of the data, and also reduction (or even elimination) 
of the effect of outliers by subsampling. There is a sense here of a connection - however loose - between the 
subset selection of minimum covariance determinant (MCD) - recall that MCD select h < n observations that 
yield the minimum covariance determinant - and the out-of-bag observations derived from the bootstrap in 
random forest. Therefore, we intend to investigate further the relationship between the fraction/proportion e~ x 
of random forest and the number h of observations used the MCD. We are also currently exploration various 
strategies of regularized MCD as a way to achieve robust classification in settings where n is much less than p. 
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